Overview

Dataset statistics

Number of variables9
Number of observations891221
Missing cells823013
Missing cells (%)10.3%
Duplicate rows80793
Duplicate rows (%)9.1%
Total size in memory107.5 MiB
Average record size in memory126.4 B

Variable types

Numeric8
Categorical1

Warnings

Dataset has 80793 (9.1%) duplicate rowsDuplicates
ANZ_HAUSHALTE_AKTIV is highly correlated with ANZ_HH_TITELHigh correlation
KBA05_HERSTTEMP is highly correlated with MIN_GEBAEUDEJAHR and 1 other fieldsHigh correlation
ANZ_HH_TITEL is highly correlated with ANZ_HAUSHALTE_AKTIVHigh correlation
MIN_GEBAEUDEJAHR is highly correlated with KBA05_HERSTTEMP and 1 other fieldsHigh correlation
KBA05_MODTEMP is highly correlated with KBA05_HERSTTEMP and 1 other fieldsHigh correlation
ANZ_HAUSHALTE_AKTIV has 93148 (10.5%) missing values Missing
ANZ_HH_TITEL has 97008 (10.9%) missing values Missing
GEBAEUDETYP has 93148 (10.5%) missing values Missing
KBA05_HERSTTEMP has 93148 (10.5%) missing values Missing
KBA05_MODTEMP has 93148 (10.5%) missing values Missing
KONSUMNAEHE has 73969 (8.3%) missing values Missing
MIN_GEBAEUDEJAHR has 93148 (10.5%) missing values Missing
OST_WEST_KZ has 93148 (10.5%) missing values Missing
WOHNLAGE has 93148 (10.5%) missing values Missing
ANZ_HH_TITEL is highly skewed (γ1 = 22.71869357) Skewed
ANZ_HH_TITEL has 770244 (86.4%) zeros Zeros

Reproduction

Analysis started2021-05-17 07:27:16.351740
Analysis finished2021-05-17 07:28:07.622347
Duration51.27 seconds
Software versionpandas-profiling v3.0.0
Download configurationconfig.json

Variables

ANZ_HAUSHALTE_AKTIV
Real number (ℝ≥0)

HIGH CORRELATION
MISSING

Distinct292
Distinct (%)< 0.1%
Missing93148
Missing (%)10.5%
Infinite0
Infinite (%)0.0%
Mean8.287263195
Minimum0
Maximum595
Zeros6463
Zeros (%)0.7%
Negative0
Negative (%)0.0%
Memory size6.8 MiB

Quantile statistics

Minimum0
5-th percentile1
Q11
median4
Q39
95-th percentile28
Maximum595
Range595
Interquartile range (IQR)8

Descriptive statistics

Standard deviation15.62808702
Coefficient of variation (CV)1.885795907
Kurtosis142.6179559
Mean8.287263195
Median Absolute Deviation (MAD)3
Skewness8.779951735
Sum6613841
Variance244.2371038
MonotonicityNot monotonic
Histogram with fixed size bins (bins=50)
ValueCountFrequency (%)
1195957
22.0%
2120982
13.6%
362575
 
7.0%
443213
 
4.8%
537815
 
4.2%
636020
 
4.0%
734526
 
3.9%
832293
 
3.6%
929002
 
3.3%
1025428
 
2.9%
Other values (282)180262
20.2%
(Missing)93148
10.5%
ValueCountFrequency (%)
06463
 
0.7%
1195957
22.0%
2120982
13.6%
362575
 
7.0%
443213
 
4.8%
537815
 
4.2%
636020
 
4.0%
734526
 
3.9%
832293
 
3.6%
929002
 
3.3%
ValueCountFrequency (%)
5958
< 0.1%
5361
 
< 0.1%
5234
< 0.1%
5154
< 0.1%
4457
< 0.1%
4389
< 0.1%
4306
< 0.1%
4143
 
< 0.1%
4042
 
< 0.1%
3953
 
< 0.1%

ANZ_HH_TITEL
Real number (ℝ≥0)

HIGH CORRELATION
MISSING
SKEWED
ZEROS

Distinct21
Distinct (%)< 0.1%
Missing97008
Missing (%)10.9%
Infinite0
Infinite (%)0.0%
Mean0.04064652681
Minimum0
Maximum23
Zeros770244
Zeros (%)86.4%
Negative0
Negative (%)0.0%
Memory size6.8 MiB

Quantile statistics

Minimum0
5-th percentile0
Q10
median0
Q30
95-th percentile0
Maximum23
Range23
Interquartile range (IQR)0

Descriptive statistics

Standard deviation0.3240284646
Coefficient of variation (CV)7.971861066
Kurtosis894.8859324
Mean0.04064652681
Median Absolute Deviation (MAD)0
Skewness22.71869357
Sum32282
Variance0.1049944458
MonotonicityNot monotonic
Histogram with fixed size bins (bins=21)
ValueCountFrequency (%)
0770244
86.4%
120157
 
2.3%
22459
 
0.3%
3585
 
0.1%
4232
 
< 0.1%
5117
 
< 0.1%
6106
 
< 0.1%
868
 
< 0.1%
765
 
< 0.1%
934
 
< 0.1%
Other values (11)146
 
< 0.1%
(Missing)97008
 
10.9%
ValueCountFrequency (%)
0770244
86.4%
120157
 
2.3%
22459
 
0.3%
3585
 
0.1%
4232
 
< 0.1%
5117
 
< 0.1%
6106
 
< 0.1%
765
 
< 0.1%
868
 
< 0.1%
934
 
< 0.1%
ValueCountFrequency (%)
233
 
< 0.1%
209
 
< 0.1%
186
 
< 0.1%
1713
< 0.1%
163
 
< 0.1%
157
 
< 0.1%
1416
< 0.1%
1329
< 0.1%
1222
< 0.1%
1122
< 0.1%

GEBAEUDETYP
Real number (ℝ≥0)

MISSING

Distinct7
Distinct (%)< 0.1%
Missing93148
Missing (%)10.5%
Infinite0
Infinite (%)0.0%
Mean2.798641227
Minimum1
Maximum8
Zeros0
Zeros (%)0.0%
Negative0
Negative (%)0.0%
Memory size6.8 MiB

Quantile statistics

Minimum1
5-th percentile1
Q11
median1
Q33
95-th percentile8
Maximum8
Range7
Interquartile range (IQR)2

Descriptive statistics

Standard deviation2.65671341
Coefficient of variation (CV)0.949286884
Kurtosis-0.06997518803
Mean2.798641227
Median Absolute Deviation (MAD)0
Skewness1.256130407
Sum2233520
Variance7.058126142
MonotonicityNot monotonic
Histogram with fixed size bins (bins=7)
ValueCountFrequency (%)
1460465
51.7%
3178668
 
20.0%
8152476
 
17.1%
24935
 
0.6%
4900
 
0.1%
6628
 
0.1%
51
 
< 0.1%
(Missing)93148
 
10.5%
ValueCountFrequency (%)
1460465
51.7%
24935
 
0.6%
3178668
 
20.0%
4900
 
0.1%
51
 
< 0.1%
6628
 
0.1%
8152476
 
17.1%
ValueCountFrequency (%)
8152476
 
17.1%
6628
 
0.1%
51
 
< 0.1%
4900
 
0.1%
3178668
 
20.0%
24935
 
0.6%
1460465
51.7%

KBA05_HERSTTEMP
Real number (ℝ≥0)

HIGH CORRELATION
MISSING

Distinct6
Distinct (%)< 0.1%
Missing93148
Missing (%)10.5%
Infinite0
Infinite (%)0.0%
Mean2.836532498
Minimum1
Maximum9
Zeros0
Zeros (%)0.0%
Negative0
Negative (%)0.0%
Memory size6.8 MiB

Quantile statistics

Minimum1
5-th percentile1
Q12
median3
Q34
95-th percentile5
Maximum9
Range8
Interquartile range (IQR)2

Descriptive statistics

Standard deviation1.491578283
Coefficient of variation (CV)0.5258456527
Kurtosis4.075225324
Mean2.836532498
Median Absolute Deviation (MAD)1
Skewness1.40012902
Sum2263760
Variance2.224805773
MonotonicityNot monotonic
Histogram with fixed size bins (bins=6)
ValueCountFrequency (%)
3275428
30.9%
1162386
18.2%
2157856
17.7%
4120193
13.5%
565321
 
7.3%
916889
 
1.9%
(Missing)93148
 
10.5%
ValueCountFrequency (%)
1162386
18.2%
2157856
17.7%
3275428
30.9%
4120193
13.5%
565321
 
7.3%
916889
 
1.9%
ValueCountFrequency (%)
916889
 
1.9%
565321
 
7.3%
4120193
13.5%
3275428
30.9%
2157856
17.7%
1162386
18.2%

KBA05_MODTEMP
Real number (ℝ≥0)

HIGH CORRELATION
MISSING

Distinct6
Distinct (%)< 0.1%
Missing93148
Missing (%)10.5%
Infinite0
Infinite (%)0.0%
Mean3.006466827
Minimum1
Maximum6
Zeros0
Zeros (%)0.0%
Negative0
Negative (%)0.0%
Memory size6.8 MiB

Quantile statistics

Minimum1
5-th percentile1
Q12
median3
Q34
95-th percentile5
Maximum6
Range5
Interquartile range (IQR)2

Descriptive statistics

Standard deviation1.255616012
Coefficient of variation (CV)0.4176384054
Kurtosis-0.7025907215
Mean3.006466827
Median Absolute Deviation (MAD)1
Skewness-0.1950516299
Sum2399380
Variance1.576571568
MonotonicityNot monotonic
Histogram with fixed size bins (bins=6)
ValueCountFrequency (%)
3267178
30.0%
4226782
25.4%
1151667
17.0%
277576
 
8.7%
565321
 
7.3%
69549
 
1.1%
(Missing)93148
 
10.5%
ValueCountFrequency (%)
1151667
17.0%
277576
 
8.7%
3267178
30.0%
4226782
25.4%
565321
 
7.3%
69549
 
1.1%
ValueCountFrequency (%)
69549
 
1.1%
565321
 
7.3%
4226782
25.4%
3267178
30.0%
277576
 
8.7%
1151667
17.0%

KONSUMNAEHE
Real number (ℝ≥0)

MISSING

Distinct7
Distinct (%)< 0.1%
Missing73969
Missing (%)8.3%
Infinite0
Infinite (%)0.0%
Mean3.018452081
Minimum1
Maximum7
Zeros0
Zeros (%)0.0%
Negative0
Negative (%)0.0%
Memory size6.8 MiB

Quantile statistics

Minimum1
5-th percentile1
Q12
median3
Q34
95-th percentile5
Maximum7
Range6
Interquartile range (IQR)2

Descriptive statistics

Standard deviation1.550311878
Coefficient of variation (CV)0.5136115586
Kurtosis-1.089230428
Mean3.018452081
Median Absolute Deviation (MAD)1
Skewness0.1833736883
Sum2466836
Variance2.403466919
MonotonicityNot monotonic
Histogram with fixed size bins (bins=7)
ValueCountFrequency (%)
1193738
21.7%
3171127
19.2%
5153535
17.2%
2134665
15.1%
4133324
15.0%
626625
 
3.0%
74238
 
0.5%
(Missing)73969
 
8.3%
ValueCountFrequency (%)
1193738
21.7%
2134665
15.1%
3171127
19.2%
4133324
15.0%
5153535
17.2%
626625
 
3.0%
74238
 
0.5%
ValueCountFrequency (%)
74238
 
0.5%
626625
 
3.0%
5153535
17.2%
4133324
15.0%
3171127
19.2%
2134665
15.1%
1193738
21.7%

MIN_GEBAEUDEJAHR
Real number (ℝ≥0)

HIGH CORRELATION
MISSING

Distinct32
Distinct (%)< 0.1%
Missing93148
Missing (%)10.5%
Infinite0
Infinite (%)0.0%
Mean1993.277011
Minimum1985
Maximum2016
Zeros0
Zeros (%)0.0%
Negative0
Negative (%)0.0%
Memory size6.8 MiB

Quantile statistics

Minimum1985
5-th percentile1992
Q11992
median1992
Q31993
95-th percentile2000
Maximum2016
Range31
Interquartile range (IQR)1

Descriptive statistics

Standard deviation3.332738997
Coefficient of variation (CV)0.001671989883
Kurtosis14.22772176
Mean1993.277011
Median Absolute Deviation (MAD)0
Skewness3.556034362
Sum1590780564
Variance11.10714922
MonotonicityNot monotonic
Histogram with fixed size bins (bins=32)
ValueCountFrequency (%)
1992568776
63.8%
199478835
 
8.8%
199325488
 
2.9%
199525464
 
2.9%
199616611
 
1.9%
199714464
 
1.6%
20007382
 
0.8%
20015877
 
0.7%
19915811
 
0.7%
20055553
 
0.6%
Other values (22)43812
 
4.9%
(Missing)93148
 
10.5%
ValueCountFrequency (%)
1985116
 
< 0.1%
1986125
 
< 0.1%
1987470
 
0.1%
19881027
 
0.1%
19892046
 
0.2%
19904408
 
0.5%
19915811
 
0.7%
1992568776
63.8%
199325488
 
2.9%
199478835
 
8.8%
ValueCountFrequency (%)
2016128
 
< 0.1%
2015717
 
0.1%
20141001
0.1%
20131230
0.1%
20121861
0.2%
20111903
0.2%
20101410
0.2%
20092016
0.2%
20082197
0.2%
20072156
0.2%

OST_WEST_KZ
Categorical

MISSING

Distinct2
Distinct (%)< 0.1%
Missing93148
Missing (%)10.5%
Memory size53.1 MiB
W
629528 
O
168545 

Length

Max length1
Median length1
Mean length1
Min length1

Characters and Unicode

Total characters798073
Distinct characters2
Distinct categories1 ?
Distinct scripts1 ?
Distinct blocks1 ?
The Unicode Standard assigns character properties to each code point, which can be used to analyse textual variables.

Unique

Unique0 ?
Unique (%)0.0%

Sample

1st rowW
2nd rowW
3rd rowW
4th rowW
5th rowW

Common Values

ValueCountFrequency (%)
W629528
70.6%
O168545
 
18.9%
(Missing)93148
 
10.5%

Length

Histogram of lengths of the category

Pie chart

ValueCountFrequency (%)
w629528
78.9%
o168545
 
21.1%

Most occurring characters

ValueCountFrequency (%)
W629528
78.9%
O168545
 
21.1%

Most occurring categories

ValueCountFrequency (%)
Uppercase Letter798073
100.0%

Most frequent character per category

Uppercase Letter
ValueCountFrequency (%)
W629528
78.9%
O168545
 
21.1%

Most occurring scripts

ValueCountFrequency (%)
Latin798073
100.0%

Most frequent character per script

Latin
ValueCountFrequency (%)
W629528
78.9%
O168545
 
21.1%

Most occurring blocks

ValueCountFrequency (%)
ASCII798073
100.0%

Most frequent character per block

ASCII
ValueCountFrequency (%)
W629528
78.9%
O168545
 
21.1%

WOHNLAGE
Real number (ℝ≥0)

MISSING

Distinct8
Distinct (%)< 0.1%
Missing93148
Missing (%)10.5%
Infinite0
Infinite (%)0.0%
Mean4.052836019
Minimum0
Maximum8
Zeros6950
Zeros (%)0.8%
Negative0
Negative (%)0.0%
Memory size6.8 MiB

Quantile statistics

Minimum0
5-th percentile1
Q13
median3
Q35
95-th percentile7
Maximum8
Range8
Interquartile range (IQR)2

Descriptive statistics

Standard deviation1.949538668
Coefficient of variation (CV)0.4810307299
Kurtosis-0.8440882651
Mean4.052836019
Median Absolute Deviation (MAD)1
Skewness0.4396769473
Sum3234459
Variance3.800701019
MonotonicityNot monotonic
Histogram with fixed size bins (bins=8)
ValueCountFrequency (%)
3249719
28.0%
7169318
19.0%
4135973
15.3%
2100376
11.3%
574346
 
8.3%
143918
 
4.9%
817473
 
2.0%
06950
 
0.8%
(Missing)93148
 
10.5%
ValueCountFrequency (%)
06950
 
0.8%
143918
 
4.9%
2100376
11.3%
3249719
28.0%
4135973
15.3%
574346
 
8.3%
7169318
19.0%
817473
 
2.0%
ValueCountFrequency (%)
817473
 
2.0%
7169318
19.0%
574346
 
8.3%
4135973
15.3%
3249719
28.0%
2100376
11.3%
143918
 
4.9%
06950
 
0.8%

Interactions

Correlations

Pearson's r

The Pearson's correlation coefficient (r) is a measure of linear correlation between two variables. It's value lies between -1 and +1, -1 indicating total negative linear correlation, 0 indicating no linear correlation and 1 indicating total positive linear correlation. Furthermore, r is invariant under separate changes in location and scale of the two variables, implying that for a linear function the angle to the x-axis does not affect r.

To calculate r for two variables X and Y, one divides the covariance of X and Y by the product of their standard deviations.

Spearman's ρ

The Spearman's rank correlation coefficient (ρ) is a measure of monotonic correlation between two variables, and is therefore better in catching nonlinear monotonic correlations than Pearson's r. It's value lies between -1 and +1, -1 indicating total negative monotonic correlation, 0 indicating no monotonic correlation and 1 indicating total positive monotonic correlation.

To calculate ρ for two variables X and Y, one divides the covariance of the rank variables of X and Y by the product of their standard deviations.

Kendall's τ

Similarly to Spearman's rank correlation coefficient, the Kendall rank correlation coefficient (τ) measures ordinal association between two variables. It's value lies between -1 and +1, -1 indicating total negative correlation, 0 indicating no correlation and 1 indicating total positive correlation.

To calculate τ for two variables X and Y, one determines the number of concordant and discordant pairs of observations. τ is given by the number of concordant pairs minus the discordant pairs divided by the total number of pairs.

Phik (φk)

Phik (φk) is a new and practical correlation coefficient that works consistently between categorical, ordinal and interval variables, captures non-linear dependency and reverts to the Pearson correlation coefficient in case of a bivariate normal input distribution. There is extensive documentation available here.

Missing values

A simple visualization of nullity by column.
Nullity matrix is a data-dense display which lets you quickly visually pick out patterns in data completion.
The correlation heatmap measures nullity correlation: how strongly the presence or absence of one variable affects the presence of another.
The dendrogram allows you to more fully correlate variable completion, revealing trends deeper than the pairwise ones visible in the correlation heatmap.

Sample

First rows

ANZ_HAUSHALTE_AKTIVANZ_HH_TITELGEBAEUDETYPKBA05_HERSTTEMPKBA05_MODTEMPKONSUMNAEHEMIN_GEBAEUDEJAHROST_WEST_KZWOHNLAGE
0NaNNaNNaNNaNNaNNaNNaNNaNNaN
111.00.08.04.01.01.01992.0W4.0
210.00.01.04.04.05.01992.0W2.0
31.00.01.03.03.04.01997.0W7.0
43.00.01.03.03.04.01992.0W3.0
55.00.01.04.03.05.01992.0W7.0
64.00.01.01.04.05.01992.0W5.0
76.00.08.03.03.03.01992.0W1.0
82.01.03.01.03.04.01992.0W1.0
99.00.03.02.04.04.01992.0W7.0

Last rows

ANZ_HAUSHALTE_AKTIVANZ_HH_TITELGEBAEUDETYPKBA05_HERSTTEMPKBA05_MODTEMPKONSUMNAEHEMIN_GEBAEUDEJAHROST_WEST_KZWOHNLAGE
8912116.00.01.04.04.02.01992.0W3.0
89121213.00.03.04.04.01.01992.0W3.0
8912134.00.03.01.01.04.01992.0W4.0
8912146.00.01.02.01.02.01992.0W3.0
8912158.00.01.04.03.02.01992.0W3.0
89121615.00.08.02.01.03.01992.0W3.0
89121711.00.08.04.04.01.01992.0W5.0
8912183.00.08.01.03.06.01992.0W7.0
8912197.00.08.03.03.02.01992.0W5.0
89122010.00.08.03.04.03.01992.0W4.0

Duplicate rows

Most frequently occurring

ANZ_HAUSHALTE_AKTIVANZ_HH_TITELGEBAEUDETYPKBA05_HERSTTEMPKBA05_MODTEMPKONSUMNAEHEMIN_GEBAEUDEJAHROST_WEST_KZWOHNLAGE# duplicates
21441.00.01.02.03.05.01992.0W7.01738
33161.00.01.03.03.05.01992.0W7.01190
24361.00.01.02.04.05.01992.0W7.01163
37271.00.01.03.04.05.01992.0W7.01102
21411.00.01.02.03.05.01992.0W3.0946
121622.00.01.02.03.05.01992.0W7.0888
11201.00.01.01.03.05.01992.0W7.0854
33131.00.01.03.03.05.01992.0W3.0840
37241.00.01.03.04.05.01992.0W3.0795
33101.00.01.03.03.05.01992.0O7.0772